{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "%matplotlib inline\n",
    "%config InlineBackend.figure_format = 'retina'\n",
    "import warnings\n",
    "\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import YouTubeVideo\n",
    "\n",
    "YouTubeVideo(id=\"3sJnTpeFXZ4\", width=\"100%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to get you familiar with graph ideas,\n",
    "I have deliberately chosen to steer away from\n",
    "the more pedantic matters\n",
    "of loading graph data to and from disk.\n",
    "That said, the following scenario will eventually happen,\n",
    "where a graph dataset lands on your lap,\n",
    "and you'll need to load it in memory \n",
    "and start analyzing it.\n",
    "\n",
    "Thus, we're going to go through graph I/O,\n",
    "specifically the APIs on how to convert\n",
    "graph data that comes to you\n",
    "into that magical NetworkX object `G`.\n",
    "\n",
    "Let's get going!\n",
    "\n",
    "## Graph Data as Tables\n",
    "\n",
    "Let's recall what we've learned in the introductory chapters.\n",
    "Graphs can be represented using two **sets**:\n",
    "\n",
    "- Node set\n",
    "- Edge set\n",
    "\n",
    "### Node set as tables\n",
    "\n",
    "Let's say we had a graph with 3 nodes in it: `A, B, C`.\n",
    "We could represent it in plain text, computer-readable format:\n",
    "\n",
    "```csv\n",
    "A\n",
    "B\n",
    "C\n",
    "```\n",
    "\n",
    "Suppose the nodes also had metadata.\n",
    "Then, we could tag on metadata as well:\n",
    "\n",
    "```csv\n",
    "A, circle, 5\n",
    "B, circle, 7\n",
    "C, square, 9\n",
    "```\n",
    "\n",
    "Does this look familiar to you?\n",
    "Yes, node sets can be stored in CSV format,\n",
    "with one of the columns being node ID,\n",
    "and the rest of the columns being metadata.\n",
    "\n",
    "### Edge set as tables\n",
    "\n",
    "If, between the nodes, we had 4 edges (this is a directed graph),\n",
    "we can also represent those edges in plain text, computer-readable format:\n",
    "\n",
    "```csv\n",
    "A, C\n",
    "B, C\n",
    "A, B\n",
    "C, A\n",
    "```\n",
    "\n",
    "And let's say we also had other metadata,\n",
    "we can represent it in the same CSV format:\n",
    "\n",
    "```csv\n",
    "A, C, red\n",
    "B, C, orange\n",
    "A, B, yellow\n",
    "C, A, green\n",
    "```\n",
    "\n",
    "If you've been in the data world for a while,\n",
    "this should not look foreign to you.\n",
    "Yes, edge sets can be stored in CSV format too!\n",
    "Two of the columns represent the nodes involved in an edge,\n",
    "and the rest of the columns represent the metadata.\n",
    "\n",
    "### Combined Representation\n",
    "\n",
    "In fact, one might also choose to combine\n",
    "the node set and edge set tables together in a merged format:\n",
    "\n",
    "```\n",
    "n1, n2, colour, shape1, num1, shape2, num2\n",
    "A,  C,  red,    circle, 5,    square, 9\n",
    "B,  C,  orange, circle, 7,    square, 9\n",
    "A,  B,  yellow, circle, 5,    circle, 7\n",
    "C,  A,  green,  square, 9,    circle, 5\n",
    "```\n",
    "\n",
    "In this chapter, the datasets that we will be looking at\n",
    "are going to be formatted in both ways.\n",
    "Let's get going."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset\n",
    "\n",
    "We will be working with the Divvy bike sharing dataset.\n",
    "\n",
    "> Divvy is a bike sharing service in Chicago.\n",
    "> Since 2013, Divvy has released their bike sharing dataset to the public.\n",
    "> The 2013 dataset is comprised of two files: \n",
    "> - `Divvy_Stations_2013.csv`, containing the stations in the system, and\n",
    "> - `DivvyTrips_2013.csv`, containing the trips.\n",
    "\n",
    "Let's dig into the data!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyprojroot import here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Firstly, we need to unzip the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import zipfile\n",
    "import os\n",
    "from nams.load_data import datasets\n",
    "\n",
    "# This block of code checks to make sure that a particular directory is present.\n",
    "if \"divvy_2013\" not in os.listdir(datasets):\n",
    "    print(\"Unzipping the divvy_2013.zip file in the datasets folder.\")\n",
    "    with zipfile.ZipFile(datasets / \"divvy_2013.zip\", \"r\") as zip_ref:\n",
    "        zip_ref.extractall(datasets)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's load in both tables.\n",
    "\n",
    "First is the `stations` table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "stations = pd.read_csv(\n",
    "    datasets / \"divvy_2013/Divvy_Stations_2013.csv\",\n",
    "    parse_dates=[\"online date\"],\n",
    "    encoding=\"utf-8\",\n",
    ")\n",
    "stations.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stations.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's load in the `trips` table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "trips = pd.read_csv(\n",
    "    datasets / \"divvy_2013/Divvy_Trips_2013.csv\", parse_dates=[\"starttime\", \"stoptime\"]\n",
    ")\n",
    "trips.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import janitor\n",
    "\n",
    "trips_summary = (\n",
    "    trips.groupby([\"from_station_id\", \"to_station_id\"])\n",
    "    .count()\n",
    "    .reset_index()\n",
    "    .select_columns([\"from_station_id\", \"to_station_id\", \"trip_id\"])\n",
    "    .rename_column(\"trip_id\", \"num_trips\")\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "trips_summary.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Graph Model\n",
    "\n",
    "Given the data, if we wished to use a graph as a data model\n",
    "for the number of trips between stations,\n",
    "then naturally, nodes would be the stations,\n",
    "and edges would be trips between them.\n",
    "\n",
    "This graph would be directed,\n",
    "as one could have more trips from station A to B\n",
    "and less in the reverse.\n",
    "\n",
    "With this definition,\n",
    "we can begin graph construction!\n",
    "\n",
    "### Create NetworkX graph from pandas edgelist\n",
    "\n",
    "NetworkX provides an extremely convenient way\n",
    "to load data from a pandas DataFrame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import networkx as nx\n",
    "\n",
    "G = nx.from_pandas_edgelist(\n",
    "    df=trips_summary,\n",
    "    source=\"from_station_id\",\n",
    "    target=\"to_station_id\",\n",
    "    edge_attr=[\"num_trips\"],\n",
    "    create_using=nx.DiGraph,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inspect the graph\n",
    "\n",
    "Once the graph is in memory,\n",
    "we can inspect it to get out summary graph statistics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(G)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You'll notice that the edge metadata have been added correctly: we have recorded in there the number of trips between stations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "list(G.edges(data=True))[0:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, the node metadata is not present:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "list(G.nodes(data=True))[0:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Annotate node metadata\n",
    "\n",
    "We have rich station data on hand,\n",
    "such as the longitude and latitude of each station,\n",
    "and it would be a pity to discard it,\n",
    "especially when we can potentially use it as part of the analysis\n",
    "or for visualization purposes.\n",
    "Let's see how we can add this information in.\n",
    "\n",
    "Firstly, recall what the `stations` dataframe looked like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stations.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `id` column gives us the node ID in the graph,\n",
    "so if we set `id` to be the index,\n",
    "if we then also loop over each row,\n",
    "we can treat the rest of the columns as dictionary keys\n",
    "and values as dictionary values,\n",
    "and add the information into the graph.\n",
    "\n",
    "Let's see this in action."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for node, metadata in stations.set_index(\"id\").iterrows():\n",
    "    for key, val in metadata.items():\n",
    "        G.nodes[node][key] = val"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, our node metadata should be populated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "list(G.nodes(data=True))[0:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In `nxviz`, a `GeoPlot` object is available\n",
    "that allows you to quickly visualize\n",
    "a graph that has geographic data.\n",
    "However, being `matplotlib`-based,\n",
    "it is going to be quickly overwhelmed\n",
    "by the sheer number of edges.\n",
    "\n",
    "As such, we are going to first filter the edges.\n",
    "\n",
    "### Exercise: Filter graph edges\n",
    "\n",
    "> Leveraging what you know about how to manipulate graphs,\n",
    "> now try _filtering_ edges.\n",
    ">\n",
    "\n",
    "_Hint: NetworkX graph objects can be deep-copied using `G.copy()`:_\n",
    "\n",
    "```python\n",
    "G_copy = G.copy()\n",
    "```\n",
    "\n",
    "_Hint: NetworkX graph objects also let you remove edges:_\n",
    "\n",
    "```python\n",
    "G.remove_edge(node1, node2)  # does not return anything\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def filter_graph(G, minimum_num_trips):\n",
    "    \"\"\"\n",
    "    Filter the graph such that\n",
    "    only edges that have minimum_num_trips or more\n",
    "    are present.\n",
    "    \"\"\"\n",
    "    G_filtered = G.____()\n",
    "    for _, _, _ in G._____(data=____):\n",
    "        if d[___________] < ___:\n",
    "            G_________.___________(_, _)\n",
    "    return G_filtered\n",
    "\n",
    "\n",
    "from nams.solutions.io import filter_graph\n",
    "\n",
    "G_filtered = filter_graph(G, 50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualize using GeoPlot\n",
    "\n",
    "`nxviz` provides a GeoPlot object\n",
    "that lets you quickly visualize geospatial graph data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A note on geospatial visualizations:\n",
    "\n",
    "> As the creator of `nxviz`,\n",
    "> I would recommend using proper geospatial packages\n",
    "> to build custom geospatial graph viz,\n",
    "> such as [`pysal`](http://pysal.org/).)\n",
    "> \n",
    "> That said, `nxviz` can probably do what you need\n",
    "> for a quick-and-dirty view of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import nxviz as nv\n",
    "\n",
    "c = nv.geo(G_filtered, node_color_by=\"dpcapacity\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Does that look familiar to you? Looks quite a bit like Chicago, I'd say :)\n",
    "\n",
    "Jesting aside, this visualization does help illustrate\n",
    "that the majority of trips occur between stations that are\n",
    "near the city center."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pickling Graphs\n",
    "\n",
    "Since NetworkX graphs are Python objects,\n",
    "the canonical way to save them is by pickling them.\n",
    "\n",
    "Here's an example in action:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pickle\n",
    "\n",
    "with open(\"/tmp/divvy.pkl\", \"wb\") as f:\n",
    "    pickle.dump(G, f, pickle.HIGHEST_PROTOCOL)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And just to show that it can be loaded back into memory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"/tmp/divvy.pkl\", \"rb\") as f:\n",
    "    G_loaded = pickle.load(f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise: checking graph integrity\n",
    "\n",
    "If you get a graph dataset as a pickle,\n",
    "you should always check it against reference properties\n",
    "to make sure of its data integrity.\n",
    "\n",
    "> Write a function that tests that the graph\n",
    "> has the correct number of nodes and edges inside it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def test_graph_integrity(G):\n",
    "    \"\"\"Test integrity of raw Divvy graph.\"\"\"\n",
    "    # Your solution here\n",
    "    pass\n",
    "\n",
    "\n",
    "from nams.solutions.io import test_graph_integrity\n",
    "\n",
    "test_graph_integrity(G)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Other text formats\n",
    "\n",
    "CSV files and `pandas` DataFrames\n",
    "give us a convenient way to store graph data,\n",
    "and if possible, do insist with your data collaborators\n",
    "that they provide you with graph data that are in this format.\n",
    "If they don't, however, no sweat!\n",
    "After all, Python is super versatile.\n",
    "\n",
    "In this ebook, we have loaded data in\n",
    "from non-CSV sources,\n",
    "sometimes by parsing text files raw,\n",
    "sometimes by treating special characters as delimiters in a CSV-like file,\n",
    "and sometimes by resorting to parsing JSON.\n",
    "\n",
    "You can see other examples of how we load data\n",
    "by browsing through the source file of `load_data.py`\n",
    "and studying how we construct graph objects."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Solutions\n",
    "\n",
    "The solutions to this chapter's exercises are below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from nams.solutions import io\n",
    "import inspect\n",
    "\n",
    "print(inspect.getsource(io))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "nams",
   "language": "python",
   "name": "nams"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
